An Empirical Study in Source Word Deletion for Phrase-Based Statistical Machine Translation

نویسندگان

  • Chi-Ho Li
  • Hailei Zhang
  • Dongdong Zhang
  • Mu Li
  • Ming Zhou
چکیده

The treatment of ‘spurious’ words of source language is an important problem but often ignored in the discussion on phrase-based SMT. This paper explains why it is important and why it is not a trivial problem, and proposes three models to handle spurious source words. Experiments show that any source word deletion model can improve a phrase-based system by at least 1.6 BLEU points and the most sophisticated model improves by nearly 2 BLEU points. This paper also explores the impact of training data size and training data domain/genre on source word deletion.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context Sensitive Word Deletion Model for Statistical Machine Translation

Word deletion (WD) errors can lead to poor comprehension of the meaning of source translated sentences in phrase-based statistical machine translation (SMT), and have a critical impact on the adequacy of the translation results generated by SMT systems. In this paper, first we classify the word deletion into two categories, wanted and unwanted word deletions. For these two kinds of word deletio...

متن کامل

Insertion and Deletion Models for Statistical Machine Translation

We investigate insertion and deletion models for hierarchical phrase-based statistical machine translation. Insertion and deletion models are designed as a means to avoid the omission of content words in the hypotheses. In our case, they are implemented as phrase-level feature functions which count the number of inserted or deleted words. An English word is considered inserted or deleted based ...

متن کامل

مدل ترجمه عبارت-مرزی با استفاده از برچسب‌های کم‌عمق نحوی

Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...

متن کامل

Chunk-Based EBMT

Corpus driven machine translation approaches such as Phrase-Based Statistical Machine Translation and Example-Based Machine Translation have been successful by using word alignment to find translation fragments for matched source parts in a bilingual training corpus. However, they still cannot properly deal with systematic translation for insertion or deletion words between two distant language...

متن کامل

Are Unaligned Words Important for Machine Translation ?

In this paper, we deal with the problem of a large number of unaligned words in automatically learned word alignments for machine translation (MT). These unaligned words are the reason for ambiguous phrase pairs extracted by a statistical phrase-based MT system. In translation, this phrase ambiguity causes deletion and insertion errors. We present hard and optional deletion approaches to remove...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008